WIP: Merge Dev to Main by danielaskdd · Pull Request #2846 · HKUDS/LightRAG

danielaskdd · 2026-03-27T05:21:38Z

Dummy PR: Merge Dev to Main (Never try to merge this PR)

…size resolution - introduce `_apply_chunk_size_overlay` to reconcile `chunk_token_size` and `chunk_overlap_token_size` across config tiers - change `chunk_token_size` and `chunk_overlap_token_size` fields to `Optional[int]` with `None` default - update `default_chunker_config` to only read strategy-specific env vars, leaving slots empty for overlay fallback - add precedence chain: addon_params explicit > strategy env > legacy constructor field > legacy env - back-fill legacy instance fields after resolution for backward compatibility with downstream readers - update Chinese documentation to reflect new configuration hierarchy and priority rules - add comprehensive tests covering constructor overlay, addon_params precedence, strategy env wins, and legacy fallback

…h semantic strategy - introduce CHUNK_P_SIZE env variable to decouple P strategy chunk size from global CHUNK_SIZE - update default_chunker_config to parse and inject CHUNK_P_SIZE into paragraph_semantic options - modify pipeline to extract and apply per-strategy chunk_token_size for P strategy with fallback to resolved top-level size - document new env variable and configuration in Chinese docs with usage guidance - add tests verifying env override behavior and fallback to global chunk size when unset

- add upper version bounds for langchain-text-splitters (<2) and langchain-experimental (<1) - remove duplicate langchain 1.x and langchain-core 1.x entries from uv.lock - add missing explicit dependencies (defusedxml, langchain-experimental, langchain-text-splitters) to api/evaluation/offline/test extras - pin async-timeout to 4.0.3 for python < 3.11 to resolve version conflicts

…e file processing documentation - reorganize document with numbered sections for server deployment workflow - add quick start section with legacy, native, and combined configuration examples - introduce detailed chunk_options configuration with environment variable reference - add new chapter for python sdk usage covering runtime api and deprecated parameters - improve clarity on engine fallback, validation, and priority chains - relocate and expand storage layout, duplicate detection, concurrency, and resume rules sections - add appendix for upgrade notes regarding deprecated multimodal global switch

…n params - ensure chunk size configuration is reconciled when runtime addon params are set - maintain consistency across all four configuration tiers

feat(chunker): add R/V chunkers and chunk_options snapshot mechanism

- move extraction-related settings below multimodal parsing section - uncomment CHUNK_P_SIZE to set default value of 3000 - improve logical grouping by placing docling settings before extraction configs

…rategies - introduce CHUNK_R_SIZE env variable for recursive character chunker - introduce CHUNK_V_SIZE env variable for semantic vector chunker - update env.example with new per-strategy size options and documentation - modify pipeline to pop and apply strategy-specific chunk_token_size - add tests for dedicated env override and fallback behavior for both R and V

- add _format_chunking_log helper to emit concise, scannable log lines - alias long parameter keys to short forms for readability - skip None and empty values to keep output compact - log before each chunking strategy call (P, R, V, F, F(legacy)) - include chunk size, relevant params, and file path in every log line

…ptions - document `CHUNK_R_SIZE` and `CHUNK_V_SIZE` environment variables - add strategy-specific size fields to recursive_character and semantic_vector examples - update priority chain to include new R and V size env variables - clarify R size favors smaller targets for sentence splitting and V size acts as advisory ceiling

… to doc metadata - rename _format_chunking_log to _format_chunking_params for reuse in both logging and metadata - add chunk_opts_str to capture and persist actual chunker params to doc_status.metadata - include chunk_opts in _DOC_STATUS_METADATA_CARRY_OVER_KEYS for visibility across status transitions

- replace three separate metadata fields with single compact string - keep same information in "pre -> post" format while reducing noise - signal split occurrence by field presence alone

- P chunker: anchor-less branch falls back to recursive_character splitting so chunk_token_size is honored even when no eligible paragraph anchor is available (e.g. dense academic prose). Previously the block was emitted as a single oversized chunk and relied on the embedding-time hard fallback, which uses embedding_token_limit (not chunk_token_size) and cannot enforce the user-configured size. - V chunker: extend default sentence_split_regex to recognize CJK sentence terminators (。？！) so SemanticChunker actually produces sentences on Chinese / mixed-language input instead of treating the whole document as one. Add post-split size enforcement via R for any piece exceeding chunk_token_size, since SemanticChunker has no native size cap. - R chunker: extend default separators with CJK punctuation (。！？；，) so Chinese documents split at semantic boundaries instead of falling through to character-level splitting. English '.?!' intentionally excluded — literal match would split numerals (0.95) and abbreviations (e.g.). - Expose CHUNK_V_SENTENCE_SPLIT_REGEX env var (alongside existing CHUNK_R_SEPARATORS) so users can customize per deployment. - Move shared defaults (DEFAULT_R_SEPARATORS, DEFAULT_SENTENCE_SPLIT_REGEX) to constants.py as the single source of truth. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ment Sync FileProcessingConfiguration-zh.md with the chunker fixes: - §2.5 process options table: explain new R default cascade (CJK punctuation tier), V's CJK-aware sentence splitter and post-split R-based size enforcement, and P's anchor-less fallback to R. - §3.2 env vars table: update CHUNK_R_SEPARATORS default, switch CHUNK_V_SIZE description from "advisory ceiling" to "hard cap", and document the new CHUNK_V_SENTENCE_SPLIT_REGEX env var. - §3.4 chunk_options JSON example: reflect new R separators default and add semantic_vector.sentence_split_regex field. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- add "breakpoint" to "break" alias - add "buffer" to "buf" alias - add "sentence_split_regex" to "regex" alias

fix(chunker): CJK punctuation support and chunk_token_size enforcement

- remove redundant alias mappings for "breakpoint" and "buffer" - shorten "breakpoint_threshold_type" alias from "breakpoint" to "break" - shorten "buffer_size" alias from "buffer" to "buf"

- remove duplicate blocks_path from p_opts passed to _format_chunking_params - prevent potential key collision since blocks_path is already extracted separately

…ic chunker - reduce CHUNK_P_SIZE in env.example from 3000 to 2000 for consistency - update chunk_token_size default in paragraph_semantic.py from 1200 to 2000

- merge AGENTS.md content into CLAUDE.md and remove duplicate file - update project structure to reflect current module layout - add workspace isolation details and pipeline concurrency contract - include WebUI commands, testing scripts, and setup wizard outputs - remove redundant sections and streamline common issues

- rename CLAUDE.md to AGENTS.md for generic AI agent usage - replace full CLAUDE.md content with reference to AGENTS.md - update .gitignore to use broader "AI Agent files" terminology

… fallback - detect table format (json / html / unknown) via explicit format= attribute, fall back to body sniffing when attrs are silent - split json tables on top-level row items and html tables on <tr> boundaries; only when no row boundary is available, or a single row alone exceeds the cap, drop to character-level fallback - apply the same table-aware fallback in stage C anchor-driven long-block re-split so non-table residuals are character-split while oversized tables retain row integrity - tests cover detect / html row extraction / json splitting / combined dispatcher; existing _expand_block_with_table_splits paths unchanged

- account for table wrapper overhead in row splitter budgets to prevent post-wrap overflows - add recursive re-splitting for table chunks that still exceed target_max after wrapping - debit newline separator tokens in no-anchor greedy packing to enforce target_max strictly - add tests for separator token accounting and wrapper overhead budgeting

- fix missing newline at end of file to follow POSIX standard

- remove single-paragraph early return and recursive guard to allow character-level splitting of oversized single paragraphs - re-measure joined content after separator tokens in tail absorption to prevent silent overflow - disable chunk overlap in recursive character fallback to honor non-overlapping contract - add regression tests for merge boundary checks, single-paragraph split, and fallback overlap behavior

…TNxo8 feat(opensearch): add basename and content_hash lookups for doc status

…tion - eliminate unnecessary `:-/` fallback in redis uri path capture - ensure exact path preservation from original uri during local service normalization

… setup scripts - add /app/data/prompts directory creation in dockerfile and dockerfile.lite - add PROMPT_DIR environment variable and volume mounts in all compose files - update setup scripts to support PROMPT_DIR configuration and idempotent mount injection - fix redis test default uri to remove trailing slash

- consolidate verbose log strings in parse_mineru and parse_docling to reduce noise - shorten analyze_multimodal opt-in missing and backfill log lines for clarity - remove redundant file_path references from completion and cache hit logs - update chinese documentation to match simplified log format

@AbstractMethod

… methods - remove default implementations of get_doc_by_file_basename and get_doc_by_content_hash - add @AbstractMethod decorator to enforce implementation in subclasses - clean up unused asdict import from dataclasses module - simplify docstrings to reflect abstract nature of methods

…ation - correct the info log message format for empty equations sidecar in analyze_multimodal

- replace specific entity_type subdirectory with entire prompts directory - update comment to reflect user customized prompt directory purpose

- disable default memgraph port exposure for improved security in template - allow users to opt-in to port exposure via environment configuration if needed

- replace file_path with doc_id in chunking log messages for better traceability - apply consistent logging format across all chunking strategies (P, R, V, F, legacy)

- change ignored path from entire prompts directory to specific entity_type subdirectory - add documentation for user-defined prompts folder purpose

- clarify default behavior when ENTITY_EXTRACTION_USE_JSON is unset - improve description of json output trade-offs with latency and reliability

…load feedback Backend: - /health derives pipeline_active = busy || scanning || destructive_busy || pending_enqueues > 0 - Also exposes pipeline_scanning / pipeline_destructive_busy / pipeline_pending_enqueues - Closes the gap where the scan classification phase set only `scanning` and the pipeline-busy button stayed grey for 5~10s Frontend: - Add activity probe: exponential-backoff /health bursts at t=0/1/2/4/8/16s fired by scan_started and the first successful upload in a batch. Exits as soon as both pipelineActive=true AND the document list has caught up. - Add refreshDocumentsThrottled(): wall-clock 2s minimum between any two /documents/paginated requests, with trailing-call coalescing. - Scan/upload no longer rely on resetHealthCheckTimerDelayed + adhoc fast polling windows — probe + active polling cover both paths. - Polling stays at 5s while pipelineActive=true even if doc list hasn't surfaced new rows yet, so the 30s idle gap right after scan disappears. - Stale trailing refresh is dropped via latestRefreshRequestVersionRef check so 2s-window page/filter/sort changes can't be overwritten by a captured old query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(pipeline-status): probe + throttled refresh for prompt scan/upload feedback

Add English versions of the three Chinese-only docs: - FileProcessingPipeline.md - LightRAGSidecarFormat.md - ParserDebugCLI.md https://claude.ai/code/session_01PEf2XkGrpo79D43GVPWn3G

…sh-NXghN' into dev

- remove legacy upgrade appendix about deprecated global multimodal switch - keep both chinese and english documentation in sync

- add RagAnything merge announcement with MinerU/Docling support - document four new text chunking strategies - add role-specific LLM configuration details

P (paragraph_semantic) chunking now uses DEFAULT_CHUNK_P_SIZE (2000) when CHUNK_P_SIZE env is unset, instead of silently inheriting the global CHUNK_SIZE / LightRAG(chunk_token_size=...). Paragraph-semantic merging needs more headroom than the global default to keep related paragraphs together; inheriting the smaller global ceiling defeats the strategy's purpose. Precedence (high → low): caller-supplied paragraph_semantic.chunk_token_size > CHUNK_P_SIZE env > DEFAULT_CHUNK_P_SIZE (2000) The backfill lives in slim_chunk_options() — the single chokepoint shared by both enqueue paths (resolve_chunk_options + caller-supplied chunk_options=). _apply_chunk_size_overlay() carries a mirror backfill so direct addon_params introspection sees the resolved value too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-default feat(chunker): give P strategy a dedicated default chunk_token_size

…d modalities - remove idempotent skip logic for existing llm_analyze_result entries - overwrite prior success/skipped/failure results on each run for enabled modalities - allow retry after fixing vlm/extract configuration without manual sidecar cleanup - rely on llm analysis cache to avoid redundant provider calls when inputs unchanged - update docs and tests to reflect new non-idempotent overwrite behavior

… response logging - add `raise_for_status_with_detail` and `response_error_detail` helpers to `_common.py` - replace ad-hoc status checks in docling and mineru clients with unified helper - include compact response body snippets in error messages for faster debugging - add test coverage for HTTP error preservation and non-2xx handling in docling and mineru

…format - add lightrag_load_errors collection to track blocks.jsonl read failures - skip documents with unreadable blocks instead of creating false "{{LRdoc}}" entries - flush failed stubs via apipeline_enqueue_error_documents inside critical section - return track_id on failure-only batches instead of None to prevent silent archival - expose file_size and original_error in failure records for better debugging

- add reference to FileProcessingPipeline.md documentation for parser setup - change example LIGHTRAG_PARSER from commented to active with new default pattern - update parser pattern to use native-teP and legacy-R fallbacks

…d paragraph semantic chunking documentation - update both zh and en quick start sections with clearer legacy, recommended and multimodal scenarios - replace mineru-centric examples with native-teP and legacy-R combinations - add new comprehensive ParagraphSemanticChunking.md with full P strategy documentation - remove outdated native-only docx examples and docling references - align zh docs with en structure and terminology

danielaskdd temporarily deployed to pypi April 26, 2026 21:48 — with GitHub Actions Inactive

danielaskdd force-pushed the dev branch from 8ef8a29 to 5c2f738 Compare April 27, 2026 06:07

danielaskdd and others added 28 commits May 9, 2026 19:58

♻️ refactor(lightrag): integrate chunk size overlay into runtime addo…

3a0628f

…n params - ensure chunk size configuration is reconciled when runtime addon params are set - maintain consistency across all four configuration tiers

Merge pull request #3046 from danielaskdd/feat/add-R-V-chunker

50fda8e

feat(chunker): add R/V chunkers and chunk_options snapshot mechanism

Merge branch 'main' into dev

7fa7144

🔧 chore(env.example): reorganize configuration sections

ad19e9f

- move extraction-related settings below multimodal parsing section - uncomment CHUNK_P_SIZE to set default value of 3000 - improve logical grouping by placing docling settings before extraction configs

♻️ refactor(pipeline): simplify hard fallback split metadata

7f7c6e0

- replace three separate metadata fields with single compact string - keep same information in "pre -> post" format while reducing noise - signal split occurrence by field presence alone

♻️ refactor(pipeline): add chunk log key aliases for compression

dc1ab7a

- add "breakpoint" to "break" alias - add "buffer" to "buf" alias - add "sentence_split_regex" to "regex" alias

Merge pull request #3050 from danielaskdd/fix/chunker-cjk-support

a993e07

fix(chunker): CJK punctuation support and chunk_token_size enforcement

Merge branch 'main' into dev

8d542d9

🔧 chore(pipeline): clean up chunk log key aliases

10f7d03

- remove redundant alias mappings for "breakpoint" and "buffer" - shorten "breakpoint_threshold_type" alias from "breakpoint" to "break" - shorten "buffer_size" alias from "buffer" to "buf"

🐛 fix(pipeline): fix blocks_path parameter duplication in chunking

a40e774

- remove duplicate blocks_path from p_opts passed to _format_chunking_params - prevent potential key collision since blocks_path is already extracted separately

🔧 chore(config): update default chunk token size for paragraph semant…

d652da8

…ic chunker - reduce CHUNK_P_SIZE in env.example from 3000 to 2000 for consistency - update chunk_token_size default in paragraph_semantic.py from 1200 to 2000

📝 docs(agents): consolidate agent guidance into AGENTS.md

36400d6

- rename CLAUDE.md to AGENTS.md for generic AI agent usage - replace full CLAUDE.md content with reference to AGENTS.md - update .gitignore to use broader "AI Agent files" terminology

Merge branch 'dev' into feat/chunker-table-row-split

07e41bf

📝 docs(CLAUDE): add newline at end of file

ebfd4d4

- fix missing newline at end of file to follow POSIX standard

danielaskdd and others added 28 commits May 20, 2026 13:55

Merge pull request #3100 from HKUDS/claude/update-opensearch-storage-…

66f3a9a

…TNxo8 feat(opensearch): add basename and content_hash lookups for doc status

Fix lintings

cc102ac

🔧 chore(setup): remove redundant default slash in redis uri normaliza…

1e3283d

…tion - eliminate unnecessary `:-/` fallback in redis uri path capture - ensure exact path preservation from original uri during local service normalization

📝 docs(docs): update log message in file processing pipeline document…

f3dddb1

…ation - correct the info log message format for empty equations sidecar in analyze_multimodal

🔧 chore(gitignore): update ignore rules for prompts directory

c050cc4

- replace specific entity_type subdirectory with entire prompts directory - update comment to reflect user customized prompt directory purpose

🔧 chore(memgraph): comment out exposed memgraph ports in docker template

de2917e

- disable default memgraph port exposure for improved security in template - allow users to opt-in to port exposure via environment configuration if needed

♻️ refactor(pipeline): update chunking log messages to use doc_id

61df440

- replace file_path with doc_id in chunking log messages for better traceability - apply consistent logging format across all chunking strategies (P, R, V, F, legacy)

🔧 chore(gitignore): update prompts ignore pattern

2213b31

- change ignored path from entire prompts directory to specific entity_type subdirectory - add documentation for user-defined prompts folder purpose

📝 docs(env.example): update entity extraction json comments

ab892c8

- clarify default behavior when ENTITY_EXTRACTION_USE_JSON is unset - improve description of json output trade-offs with latency and reliability

Bump API version to 0294

6da3f9c

Merge pull request #3101 from danielaskdd/refact/pipeline-status-refresh

295babc

feat(pipeline-status): probe + throttled refresh for prompt scan/upload feedback

📝 docs: translate file processing pipeline docs to English

2db077a

Add English versions of the three Chinese-only docs: - FileProcessingPipeline.md - LightRAGSidecarFormat.md - ParserDebugCLI.md https://claude.ai/code/session_01PEf2XkGrpo79D43GVPWn3G

Merge remote-tracking branch 'upstream/claude/translate-docs-to-engli…

dd5b19d

…sh-NXghN' into dev

📝 docs(pipeline): remove deprecated appendix from file processing docs

e242596

- remove legacy upgrade appendix about deprecated global multimodal switch - keep both chinese and english documentation in sync

📝 docs(README): update news section with latest features

c0d3ebd

- add RagAnything merge announcement with MinerU/Docling support - document four new text chunking strategies - add role-specific LLM configuration details

Merge pull request #3102 from danielaskdd/feat/p-chunk-size-dedicated…

33f3366

…-default feat(chunker): give P strategy a dedicated default chunk_token_size

Bump API version to 0295

9aaf7b8

📝 docs(FileProcessingPipeline): fix env file reference in documentation

a997100

danielaskdd merged commit b62c260 into main May 21, 2026
4 of 5 checks passed

danielaskdd deleted the dev branch May 25, 2026 04:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Merge Dev to Main#2846

WIP: Merge Dev to Main#2846
danielaskdd merged 622 commits into
mainfrom
dev

danielaskdd commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielaskdd commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants